Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use timely's logging infrastructure to log Tracker state #321

Closed
wants to merge 4 commits into from

Conversation

saradecova
Copy link

@saradecova saradecova commented Mar 5, 2020

Add TrackerEvent which records additions or removals
of capabilities, as well as propagation events when changes
in implications are propagated along the internal connections
and edges of the graph.

Add DebugEvent which records the state of pointstamps,
implications, and worklist of Tracker.

Enabling loggers for these events is done the same way as for
logging::TimelyEvent's

At the moment, we use these loggers for comparative testing between
Isabelle implementation of progress tracking and our Rust implementation.

This graph plots the runtime of the computation with and without the changes, and suggests
that the increase is not significant despite the changes being on the critical path.
The example used is timely/examples/barrier.rs with 10,000,000 samples, run
with four workers. The graph shows complementary cdf for:

  1. current master,
  2. change with all loggers disabled (almost identical to (1)),
  3. original TimelyEvent logger, and
  4. original TimelyEvent logger + TrackerEvent logger.

We got similar results when run on other examples.

Add TrackerEvent which records additons or removals
of capabilities, as well as propagation events when chanages
in implications are propagated along the internal connections
and edges of the graph.

Add DebugEvent which records the state of pointstamps,
implications, and worklist of Tracker.

Enabling loggers for these events is done the same way as for
logging::TimelyEvent's.
@@ -317,6 +317,13 @@ impl<T: PartialOrder+Ord+Clone> MutableAntichain<T> {
self.frontier().less_equal(time)
}

/// Clones the vector of updates.
/// Only used for debugging purposes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you say a bit more about what this is used for in the comments? If this lands, I'll need to support it, and it would help to understand why it is here and under which circumstances it could go away.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the function as it's no longer needed.

/// Clones the vector of updates.
/// Only used for debugging purposes.
#[inline]
pub fn updates(&self) -> Vec<(T, i64)> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this return a &[(T, i64)] instead, to avoid a mandatory allocation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the function completely as it's no longer needed.

@@ -145,8 +147,9 @@ pub struct Builder<T: Timestamp> {
impl<T: Timestamp> Builder<T> {

/// Create a new empty topology builder.
pub fn new() -> Self {
pub fn new(path: Vec<usize>) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a big abstraction change, in that reachability.rs used to be agnostic to the hierarchical nature of names in timely dataflow, and just did reachability tracking in a scope with no additional information. This seems to bake that in now, which .. will have to ponder whether that is a good call or not.

Another option might be to produce another implementation of reachability.rs, putting things behind a trait.

Not trying to be difficult, but attempting to minimize the complexity in an already-too-complicated bit of logic.

Copy link
Author

@saradecova saradecova Mar 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I removed the path argument from from Tracker and Builder constructor.

However, it still must be supplied when the logger is registered. The logger now looks like this:

pub tracker_logger: Option<(Vec<usize>, crate::logging::Logger<TrackerEvent>)>,

The logging events such as "a capability is added/removed from a location" do not make sense if they cannot be tracked down to a particular location.

Let me know what you think please and whether you would still prefer going your way.

@frankmcsherry
Copy link
Member

I have some general questions about the PR!

I think I understand the goal, which is to get out information about the steps that reachability.rs takes as it runs. What I'm less clear on is whether this is the best way to do that. I think that what's in reachability.rs is essentially deterministic once its inputs are specified. So, if we captured the timestamp changes that are supplied as inputs, and the moments at which propagation happens, I think this would be sufficient information to re-execute the code and see what happens, but without asking reachability.rs to record this for you.

Would it serve your purposes just as well to instrument the moments and nature of pointstamp updates and propagation?

@saradecova
Copy link
Author

Thank you for the comments! I changed the logging so there is only one additional logger (as apposed to two) logging the minimum information to re-execute the code as you suggested.

@utaal
Copy link
Member

utaal commented Mar 13, 2020

Hi folks, a couple of more notes from a conversation with @saradecova .

We're using Debug to convert timestamps of arbitrary types to Strings for logging:

https://github.com/TimelyDataflow/timely-dataflow/pull/321/files#diff-d233d1f563e31c39b0ad93a7182fdcb3R493

This seems reasonable (we wouldn't know how to encode the type otherwise), however this makes it harder (impossible) for the consumer to know what type to expect for a certain scope. As an example, to be able to replay the behaviour of the Tracker using a trace generated with these events, we need to parse the timestamps (to re-establish the partial order, mainly).

We're wondering if it makes sense to add (or adjust an existing) TimelyEvent to output type information for scope timestamps. This way, a consumer can determine how to parse the stringified timestamps (possibly by providing FromStr implementations).

One option would be to use TypeId (docs), which unfortunately hides its internals, and makes it hard (impossible) to parse at the receiver.

I'm still considering options, but @frankmcsherry let us know if you have opinions.

@saradecova
Copy link
Author

To follow on @utaal and our conversation, we can encode the type in string using std::any::type_name.

Moreover, by adding internal_summaries to OperatesEvent it would be also beneficial to have the timestamp type for users of TimelyEvent log-stream.

In pracrise, this could be an event informing us that "A new subscope was created at address addr from the root with associated timestamp_type:

/// The creation of a `Subgraph`.
pub struct SubgraphEvent {
    pub addr: Vec<usize>,
    pub timestamp_type: String,
}

@utaal
Copy link
Member

utaal commented Mar 13, 2020

Also, it may be the right time to address these todo(s) in ProgressEvent:

messages: Vec::new(),
internal: Vec::new(),

messages: Vec::new(),
internal: Vec::new(),

Loggs an event whenever a new instance of a Subgraph is created.
@frankmcsherry
Copy link
Member

I'm back to looking at this. Very sorry for the delay.

I have several spot comments, and generally think that before landing durably in timely it needs a bit more design work. In particular,

  1. There are some use cases close to but not yet well-served by this (most interesting one from several users: which (operator, timestamp) pairs are on the frontier). I'd like to sort these out, but that doesn't have to block this.

  2. The use of String for timestamps weirds me out. It also seems to be used for Antichain, which .. I can see how this gets you the data you need, but it can't be the best way. :D

  3. The change to OperatesEvent is substantial; I wonder if it could be normalized out in to another stream?

I still don't have a great read on the requirements here. I apologize if my comments have been confused. My understanding is that you want to be able to extract progress information from the reachability subsystem, and I'm guessing that is to drive your work on progress stuffs. Do you need the fine-grained update information, or just the aggregate information extracted in propagate_all? Do you expect the information to be helpful outside of your work (roughly: should this be a branch, or is it valuable to have all timely users have access to this)?

I'm currently trying to reconcile this with requests other information asks about progress tracking, that I think are more about "log the state of the dataflow-wide frontier". It probably relates, and definitely has the same awkwardness around timestamps being generic.

Anyhow, I'm thinking about this now, and trying to understand which things are important to log and which are optional!

@frankmcsherry
Copy link
Member

frankmcsherry commented May 8, 2020

I have another ask: is there a qualitative difference for you between logging the progress updates in update_target and update_source as you do, vs perhaps logging them in propagate_all as (or just before) they are drained out of their buffers. The latter has the benefit of being a bit less noisy (we have consolidated the updates and potentially canceled some out), but it would remove visibility in to the first moments at which the tracker knows about some updates (in case that was part of your study).

EDIT: It also has the nice property that they can be logged transactionally, all at the same timestamp, which may avoid transient weirdness for folks looking at the data (I don't believe we work hard to fold in positive updates before negative updates).

This looks like a good direction to go to expose information about the state of system progress, but I bet my requirements are not your requirements (e.g. that change above is fine for me, but I don't want to do it if it breaks your reqs).

There are some stray lines added and removed; I can fix these up. Mostly I'm trying to map out "imagine this lands; what 'improvements' should be prohibited?"

Comment on lines +177 to +184
// Perhaps log information about the creation of subgraph.
if let Some(l) = self.logging.as_mut() {
l.log(crate::logging::SubgraphEvent{
id: worker.index(),
addr: path.clone(),
timestamp_type: std::any::type_name::<TInner>().to_string(),
});
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to reframe this as a TrackerEvent and have it be part of the line just up above (i.e. "tracker came in to existence"). I suspect something like Tracker::install_logger(...) could do both of those things and wrap up the abstraction well. I'm happy to do that after the fact if that works for you.

Comment on lines +538 to +541
// double-check that child 0 (the outside world) is correctly shaped.
assert_eq!(self.children[0].outputs, self.inputs());
assert_eq!(self.children[0].inputs, self.outputs());

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why these moved down 20 lines?

Comment on lines +422 to +423
let (internal_summary, _) = operator.get_internal_summary();

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If at all possible, I'd like to keep this next to the set_external_summary() call just to be clear that they are paired. I'm happy to have it hoisted as well.

Comment on lines +67 to +68
/// Internal summary for every combination of input and output port.
pub internal_summaries: Vec<Vec<String>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what these are used for? Would it be equally beneficial to have the report from the tracker about its input-to-output summaries? That would leave this event stable and consolidate the timestamp/summary related events to the reachability tracker.

@@ -597,7 +629,6 @@ impl<T:Timestamp> Tracker<T> {
// will discover zero-change times when we first visit them, as no further
// changes can be made to them once we complete them.
while let Some(Reverse((time, location, mut diff))) = self.worklist.pop() {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random whitespace

@@ -654,6 +685,7 @@ impl<T:Timestamp> Tracker<T> {
};
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random whitespace

@frankmcsherry
Copy link
Member

Note to self: this would be great pointed at the new logging channel now introduced in #352.

utaal added a commit to utaal/timely-dataflow that referenced this pull request Jan 25, 2021
@frankmcsherry
Copy link
Member

Closing in favor of #375 which borrows heavily from this. If it turns out that there is an urgent need for e.g. logging the topology information, I can certainly make that happen too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants